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A METHOD FOR ESTIMATING PROPORTIONS 


L. F. Guseman, Jr. and Bruce P. Marlon 


1. Incroductlon 


Let (n,t>q/,P) be a probability apace, and suppose that ■ U II. 


k-1 


where each 11^0 Ilj "0, i J, and the unknown a priori proba- 
bilities are positive. Let X : -► r" be a random vector 

with conditloi iensity functions ,l<j<m, and 

m 

mixture density f - f ■ ^ Suppose we are given a classifies 

j-1 J -J 

tion procedure defined by regions R^, 1 < 1 < m, (which partition r") 
and a decision function c defined for u c 12 by 


c(w) - 1 iff X(u); e R^ . 


Then the probability that u C 12 is classified as belonging to 11^ is 
given by 


p([x e Rj) - p([x e R ] n ( u n ); 

1 ^ j-1 J 


m 


- p( u (IX e R ] n n )) 
j-1 ^ 


- I p([x e R ] n n ) 

J-1 ^ 


- I I'dx • K, j 111.) I' (II.) 
J-1 ^ J J 


•la, P([Xe RJ|n.) . 

j-1 J ^ J 


2 


Lec Y where Y ■ Xb o X and Xb denotes the 

• X ^ ^ 

function of the set Q r", 1 < 1 < m. Then 


E(YJ - E(x p (X)) 


■4” “i 


(x) f(x) dx 


f(x) dx 


'A 

7 ^ in 

i«, f. 

Rj j-1 J ^ 

tn /• 

" ^ “j / 

J-1 J •' R^ J 

o 

- [ a P((X e RJ |n.) . 

j-1 J ^ J 


(X) dx 


(x) dx 


Let (i) > (u) , U) ,..., 0 ).,) be a random sample of size N from H. 

1 2 N 

given i, 1 < i < m, let 






’in*" > ■ 


characteristic 


For a 
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Then for fixed 1, “*■* independent random variables and each 

has the name distribution as ([7]); that lt>, ■ E(Y^), 1 $ 5 

1 N N *^i 

Letting «£ “ jj I , we have e^(oj ) - — , where Is the number 

k“l 

of elements in o)^ that arc classified as being from If e^ ■ £(2^^) , 

then 


E(a^) 

- E(^ 

1 N 


i V 

N , 

"^"ik 

k-1 



m 

E(Y ) 

- 1 

X 

J-1 


N 

I 

k-1 


N 

I 

k-1 


Letting 


/ “l \ 

' a 
m 


A 

()■ 


m 


m 


we have e - E(e) - Pa, where P is an m x m matrix whose entry in 

t h t h 

the 1^ row and j column, is given by 


Pjj - P([X C Rjllllj) • y fj(x) dx , i,J - 1,2 m. 

N ^ 

We note that a classification procedure produces an estimate e^(o) ) “ 
of which is biased whenever e^ - li(e^) ^ o^, 1 < 1 < m . The 
equation e - E(e) - Pa holds for the error matrix T associated with the 
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classification procedure used to determine c from a given sample. 
Consequently, an estimate of a could be given by a solution S of the 
following problem: 

minimiz*? ~ (Euclidean norm) 

(*) 

m 

subject to J a • 1, a > 0, 1 < 1 < m . 
i-1 

If P is invertible, then a ■ P ^ e is an unbiased estimate of a; 
that is, 

E(^ - E(P"^e) - P”^ E(e) - P"^ Pa - a . 

However, simple examples show that even in this case a * P ^ e need 

m 

not satisfy the nunnegatlvlty constraints even though ][ a 1, 

1-1 ^ 

For a given P and e, problem (*) above reduces to the following 
quadratic programming problem: 

minimize the convex functional 
(**) T(a) - y a^ P^ Pa - e^ Pa 

over the constraint set 

m I 

S - ja - (a, ,...,a )^ ; I a - 1, a > 0, 1 < i < m . 
\ 1 n 1 1 ■* " " 1 

T 

The functional T is convex (since P P is positive semi-definite) and 

continuous. Since S is compact and convex, a solution always exists. 

T 

Wlien P is Invertible, tlien I’ P is positive definite and T is strictly 


convex so tiuit tl>e solution is unique. The above results on convexity ot 
T and uniqueness of the solution can be found in (3]. 
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2. A Method For Computing P and ^ 

Suppose that each conditional density for the random vector X 

la multivariate normal with knutm (or estimated) mean vector and 

covariance matrix 1 < 1 < m; that is, f^(x) - N(p^. Z^), 

1 < 1 < m. Under the assumption of equal a priori probabilities 
1 IT 

(l.e. a • ( ), there exit .s (see [4]) a 1 x n vector B of 

o m m o 

norm one such ttiat 


g(B^) - min g(B), 


where 


g(B) ■ 1 - ^ / niax f (y,E)dy 


l<l<m 


and f^(y,B) ^ N(Bpj, BZ^B ), 1 < 1 < m. Then the entries in P 


(P ^ 


can be readily computed using the expressions 


IJ 


/ 

•'r,(b ) J 

1 O 


(y.B^)dy , i,J - 1,2, ... ,m 


where R.(B ) • "j y e : f.(y,B ) ■ max f (y,B )i , 1 < 1 < m . 
^ ° ' 1 o l<j<m J ° ' 


Classifying the sample u ■ (u , . . . ,u„) according to the rule 

1 N 


c(u) ■ 1 if and only if B (X(u)) e R. (B ) 

o 1 o 


produces the values and hence e^, 1 < i < m. 
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The JilniffllzinK vector B , decision regions K . (B ) , 1 < 1 < m, 

o 1 o - - 

and error matrix P ■ can be computed using the program LFSPMC 

described in [3]. The sample u)^ ■ (w f • s • I ) is then classified 

1 N 

according to the above rule to produce e using the classification 
capability of LFSPMC. 


3 . P reliminary Numerical Result s 

Toe data fur the numerical results presented in this section 
consisted of 30 sets if training statistics and a sample cf 16- 
dimensional vectors of size 8400 obtained from four registered passes 
(May 5, May 23, June 11, June 29, 1973) of LANDSAT 1 MSS measurements 
acquired over a 14 square mile test site in Hill County (N) , Montana 
(see [1]). For all runs made the error matrix P was determined from the 
first of the 30 sets of training statistics provided. A subsarople of 
size 2417 of the original sample was used to compute e using the 
classification procedure which gave rise to P. The sample of size 
2417 was made up of vectors from the following five classes: 

Wlioat (784), Fallow (744), Barley (300), Grass (206), and Stubble 
(383). 

Three runs were made using all five classes. Run 1 used LFSPMC 
and the training statistics froi-i rlt-j three registered passes of 
May ?.3, June 11, and June 29 to determine P and e. Run 2 used LFSPMC 
and the training statistics from the pass of Jui.e 11 to determine 
P and e. lor purpose of comparison. Run 3 used an estimated error 
matrix determined from a maximum likelihood classification ot 12- 


dimensional veccors randomly generated using the training statistics 
for the aforementioned three registered passes. The same classifier 
was used to determine e from the sample of size 2417. 

Additional runs were made for the two class case (Wlteat, Barley) 
by using LFSPMC to determine P and e from three passes (Run 4) and 
one pass (Run 3) . 

For a given P and e, two quadratic prograimlng algorithms were 
used to solve problem (**) of the ,'revlous section. An algorithm 
based on the complementary pivot method of Lemke (see [6]) was 
employed for the cas9 of nonslnguiar P. In the case where no unique 
minimum exists (l.e. P slnpular), a modification of the Frank-Wolfe 
algorithm (2] due to B. Charles Peters, Jr. was used. The results 


of the runs are Humoarlzcd In Tables 1 and 2. 


ERROR MATRIX 


'.738 .003 

.003 .625 

P - .113 .000 

\ .146 .206 

\. 000 . 166 


184 

.088 

.018 \ 

000 

.145 

.444 

809 

.000 

.000 

007 

.767 

.192 ] 

000 

.000 

. 347/ 


CLASSIFIED SAMPLE 

e - (.288, .264, .137, .189, .121)^ 


ESTIMATED PROPORTIONS 

a - (.347, .243, .121, .056, .233)^ 


RUN 1: Five Classes — Three Pass Case 

P and e Determined By LFSPMC 
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EKKOP MATRIX 



CLASSIFIED SAMPLE 

e - (.332, .357, .098, .183, .030)^ 

ESTIMATED PROPORTIONS 

a •• (.376, .465, .084, .076, .000) 


RUN 2: Five ClaBseb — One Pass Caee 

P And e Determined by LFSPMC 


lu 


ERROR MATRIX 


965 

.000 

.025 

.005 

.000 

000 

.910 

.000 

.000 

.075 

015 

.015 

.975 

.000 

.000 

010 

.005 

.000 

.970 

.000 

010 

.070 

.000 

.025 

.925 


CLASSIFIED SAMPLE 

c - (.316, .271, .142, .080, .192)’^ 

ESTIMATED PROPORTIONS 

a - (.324, .283, .135, .077, .180)^ 


RUN 3: Five Claases — Threa Paaa Case 

Maximum Likelihood Claaalfier To Determine e And Estimate P 



II 


ERROR MATRIX CLASSIFIED SAMPLE 



ESTIMATED PROPORTIONS 

Two Classes — Three Pass Case 
P And e Do ter mined By LFSPHC 


ERROR MATRIX CUSSIFIED SAMPLE 



ESTIMATED PROPORTIONS 



RUN S: Two Classes — One Pass Case 

P And e Determined by LFSPMC 


Tru* 

Proport ..ons 


Eatiik«Ced 

P-matrlx 


Three 

Pass 


One 

Pass 


Wheat 

.324 

.324 

.347 

.376 

Fallow 

.308 

.283 

.243 

.465 

Barley 

.124 

. 135 

.121 

.084 

Grass 

.085 

.077 

.056 

.076 

Stubble 

.159 

.180 

.233 

.000 


Table 1. 

Estimated Proportions 

: Five 

Classes 



True 

Three 

One 


Proportions 

Pass 

Pass 

Wheat 

.723 

.714 

.716 

Barley 

.277 

.286 

.284 


Table 2. Estimated Proportions; Two Classes 
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4 , Remar kM 

The proportion eeClmaCion procedure presented in the previous 
sections has the advantage that the error matrix is det.^rmlned by the 
training statistics and thereby requires only one set of ground 
truth. In addition, rhe error matrix is the error matrix for the 
classification procedure used to determine e. It has the disadvantage 
that the training statistics must be representative of the meun 
vectors and covariance matrices for the populations from which t.ie 
sample was nuide. 

The error matrix is directly related to the probability of 
misclassif ication and should be more diagonally dominant with the 
in .rease in number of passes used. It should also be mentioned 
that, under the assumptions of distinct classes and equal a priori 
probabilities, the error matrix computed by LFSPMC should (barring 
numerical difficulties) always be nonsingular. 

Both of the quadratic programming algorithms used were essentially 
off-the-shelf programs and require some refinements. The complementary 
pivot algoritlun failed to always meet the problem constraint, 
m 

^ 3*1, to within machine accuracy, and the modified Frank-Wolfe 

i-1 ^ 

algorithm proved to converge slowly. In any event, the determination 
of P and e using LFSPMC, and subsequent determination of 3 was 
always accomplished in less than two minutes for the runs reported 
here. Investigations into the development of more accurate and 
efficient quadratic programming algor itlims are underway. 
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Theoretical InveaClgatlone are also underway to extend rhe feature 
selection algorithm to the case where the density function for e< h 
population Is a convex combination of multivariate normal densities. 

The resulting algorithm gives rise to a method for estimating propor- 
tions which Involves only two classes; namely wheat and non-wheat. 
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